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Abstract In application of the Balancing Domain Decomposition by Constraints 
(BDDC) to a case with many substructures, solving the coarse problem exactly be- 
comes the bottleneck which spoils scalability of the solver. However, it is straight- 
forward for BDDC to substitute the exact solution of the coarse problem by another 
step of BDDC method with subdomains playing the role of elements. In this way, 
the algorithm of three-level BDDC method is obtained. If this approach is applied 
recursively, multilevel BDDC method is derived. We present a detailed description 
of a recently developed parallel implementation of this algorithm. The implemen- 
tation is applied to an engineering problem of linear elasticity and a benchmark 
problem of Stokes flow in a cavity. Results by the multilevel approach are compared 
to those by the standard (two-level) BDDC method. 
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The Balancing Domain Decomposition by Constraints (BDDC) method introduced 
in |2| is one of the most advanced methods of iterative substructuring for the so- 
lution of large systems of linear algebraic equations arising from discretization of 
boundary value problems. However, in the case of many substructures, solving the 
coarse problem exactly becomes the limiting factor for scalability of the otherwise 
perfectly parallel algorithm. This has been observed also for the FETI-DP method 
(e.g. in fljj), which is closely related to BDDC. For this reason, recent research in 
the area is directed towards inexact solutions of the coarse problem. For example, 
algebraic multigrid is used in [ 3 1 to obtain an approximate coarse correction within 
the FETI-DP method, and excellent scalability is achieved. 

We follow a different approach in this contribution. As was mentioned already 
in 1 2 1, it is quite straightforward for BDDC to substitute the exact solution of the 
coarse problem by another step of the BDDC method with subdomains playing the 
role of elements. In this way, the algorithm of three-level BDDC method is obtained 
(studied in JSJ). If this step is repeated recursively, one arrives at the multilevel 
BDDC method (introduced in [4] without a parallel implementation). Unlike for 
most other domain decomposition methods, such extension is natural for BDDC, 
since the coarse problem has the same structure as the original problem. Although 
the mathematical theory in [4] suggests worsening of the efficiency of the multi- 
level BDDC preconditioner with each additional level, the resulting algorithm may 
outperform the standard method with respect to computational time due to better 
scalability. This fact makes the algorithm a good candidate for using on future mas- 
sively parallel systems. 

In this paper, we present a recently developed parallel implementation of multi- 
level BDDC method. It is applied to an engineering problem of linear elasticity and 
a benchmark problem of Stokes flow in a cavity. The results suggest which draw- 
backs of the two-level implementation might be overcome by the extension to more 
levels. Our solver library has been released as an open-source package. 



2 BDDC preconditioner with two and more levels 

The starting point for BDDC is the reduced interface problem Su = g, where S is 
the Schur complement with respect to interface, i.e. unknowns shared by more than 
one subdomain, u is the part of vector of coefficients of finite element basis functions 
at the interface, and g is sometimes called condensed right hand side. This problem 
is solved by a Rrylov subspace method in the framework of iterative substructuring. 
Within these methods, application of S to a vector is realized by parallel solution of 
independent discrete Dirichlet problems. In this way, the costly explicit construction 
of the Schur complement is avoided. However, since it is not the main concern of 
this contribution, the reader is referred to paper (5), or monograph Q for details of 
iterative substructuring. 
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In what follows, we turn our attention towards the second key part of Rrylov 
subspace methods - the preconditioner, which is realized by one step of the BDDC 
method. Let us begin with description of the standard (two-level) version of BDDC. 
Let K,- be the local subdomain matrix, obtained by the sub-assembling of element 
matrices of elements contained in z'-th subdomain. We introduce the coarse space 
basis functions on each subdomain represented by columns of matrix ^P,-, which is 
the solution to the saddle point problem with multiple right hand sides 
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Matrix C, represents constraints on functions "J*/, one row per each. These con- 
straints enforce continuity of approximate solution at corners and/or continuity of 
more general quantities, such as averages over shared subsets of interface (edges 
or faces) between adjacent subdomains. The local coarse matrix Kq- = *Pf = 
—A, is constructed for each subdomain. The global coarse matrix is obtained 
by the assembly procedure from local coarse matrices. This can be formally written 
as Kc = Yl^= \ Rc;KciRci> where Rq' realize the restriction of global coarse degrees 
of freedom to local coarse degrees of freedom of i-th subdomain. 

Suppose ? = g — S u is a residual within the Rrylov subspace method. The resid- 
ual assigned to z'-th subdomain is computed as r, = Ef r, where matrices of weights 
Ef distribute r to subdomains. The subdomain correction is now defined as the so- 
lution to the system 
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The residual for the coarse problem is constructed using the coarse basis functions 
subdomain by subdomain and assembling the contributions as = J^Lj R^^Pf Eff . 
The coarse correction is defined as the solution to problem KcZc = rc. Both cor- 
rections are finally added together and averaged on the interface by matrices E, to 
produce the preconditioned residual z = J^=i E, (WiRcfiC + z i) ■ 

In the three-level BDDC method [8|, the matrix is not constructed on the 
second level. Instead, subdomains from the basic (first) level are grouped into sub- 
domains on the next (second) level in the same way as elements of the original mesh 
are grouped into subdomains of the first level. The whole procedure described in this 
section is now repeated for the second level and thus the final coarse problem repre- 
sents the third level. Obviously, this can be repeated again in the multilevel BDDC 
method. The only important difference between the first and the higher levels is the 
additional interior pre-correction and post-correction applied on higher levels in 
order to approximate the whole vector of coarse solution on the lower level. 

According to [4 |, the condition number of the operator preconditioned by multi- 
level BDDC with L levels satisfies k(M bddc S) < U^Zi c e (l +log ^) , where 
Hi is the characteristic size of subdomain on level £, and Ho = h is the characteristic 
size of element. Index I is used here and throughout the next section to denote par- 
ticular level. Due to the product present in this bound, each additional level worsens 



4 



Jakub Sfstek et al. 



the mathematical efficiency of the multilevel preconditioned The proof of the con- 
dition number bound as well as details of the algorithm of multilevel BDDC can be 
found in |4). 



3 Parallel implementation 

Our implementation of the multilevel BDDC method has been recently released 
as an open-source solver library BDDCMLQ It is written in Fortran 95 program- 
ming language and parallelized by MPI. The solver relies on the sparse direct solver 
MUMPS — a serial instance is used for each subdomain problem and a parallel 
instance is called for the final coarse problem. The solver supports assignment of 
several subdomains to each processor, since it is often useful to create divisions 
independently of number of available processors. A division of the mesh into sub- 
domains on the first level is either provided to the solver by user's application or 
created internally by ParMETIS. The METIS package is currently used for this pur- 
pose on higher levels. 

Similarly to other related preconditi oners, we first need to set-up the multilevel 
BDDC preconditioner, which is then applied in each iteration of the Krylov sub- 
space method. Details of the set-up are given in Algorithm [T] while key operations 
of each application are summarized in Algorithm [2] In these descriptions, we pro- 
vide comments on how the steps are implemented in BDDCML in parentheses. 



Algorithm 1 Set-up of BDDC preconditioner with L levels 

1: for levels = 1,...,L- 1 do 
2: iff >1 then 

3: build pseudo-mesh: subdomains — > 'elements'; comers + edges + faces — > 'nodes' 

4: end if 

5: divide pseudo-mesh into subdomains (by METIS for i > 1, or by ParMETIS for £ = 1) 

6: classify interface into/ace.s, edges, vertices 

7: select corners (using face-based algorithm from |6|) 

8: assemble matrices of subdomains K' (use MPI to collect them on assigned cores) 
9: prepare interior correction - factorize interior block of K' (serial MUMPS) 
10: factorize the matrices of local saddle point problems (fT) (serial MUMPS) 
1 1 : find coarse basis functions 1*' and coarse matrices K^T = — Aj from jljl (serial MUMPS) 

12: end for 

13: factorize global coarse matrix K^T 1 = Y^-i ( R L-l )r K i-l R t-i ( para ii e l MUMPS) 



http : / /www . math .cas.cz/ $ \sim$ sistek/ sof tware/bddcml . html 
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Algorithm 2 Application of BDDC preconditioner with L levels 

for level 1= 1,...,L- 1 do 
iff > 1 then 

compute interior pre-correction of residual r* (serial MUMPS) 
end if 

distribute residual among subdomains rf = (Ef ) T T- 
determine subdomain corrections z[ from (|2l (serial MUMPS) 
construct coarse residual r e c = J^, (R^.) r (1 / f) 7 '(Ef J 7 "?* (collective MPI) 
end for 

solve the coarse problem K^ 1 z^T 1 = r£f ] (parallel MUMPS) 
forleveH = L- do 
iti<L- 1 then 

z< t <-z e + l 
end if 

combine coarse correction and subdomain corrections zf = E- ('P-'R^.Zp - 
if i > 1 then 

apply interior post-correction to (serial MUMPS) 
end if 
end for 



4 Numerical results 

The first example corresponds to a problem of mechanical analysis of a cubic sam- 
ple of geocomposite and was analyzed in The length of the edge of the cube 
is 75 mm. The cube comprises five distinct materials identified by means of com- 
puter tomography (Fig.[T]left), which causes anisotropic response of the cube even 
for simple axial stretching in z direction (Fig. [T] right). The problem is discretized 
using unstructured grid of about 12 million linear tetrahedral elements, resulting 
in approximately 6 million unknowns. The mesh was divided into 1024, 128, and 
16 subdomains on the first, second and third level, respectively, and the respective 
coarse problems (using corners and arithmetic averages on all edges and faces) con- 
tain 86,094, 11,265, and 612 unknowns. 

Table [T] summarizes the efficiency of the multilevel preconditioner by means of 
the resulting condition number (estimated from the tridiagonal matrix generated dur- 
ing iterations of preconditioned conjugate gradient (PCG) method) and number of 
iterations. The iterations were stopped when the relative residual ||r||/||g|| decreased 
bellow 10~ 6 . This table confirms the predicted worsening of the condition number 
with each additional level expected from the condition number bound. 

Table|2]contains a strong scaling test using different number of levels. We differ- 
entiate the time spent on set-up and in PCG. All these computations were performed 
on the IBM SP6 computer at CINECA, Bologna. The computer is based on IBM 
Power6 4.7 GHz processors with 4 GB of RAM per core. 

We can conclude from Tab.[2]that while adding levels seems not to be feasible for 
small number of cores (the computational time stagnates or even grows), it improves 
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the scaling on many cores. The minimal overall solution time is achieved for four 
levels and largest number of cores, despite the largest number of required iterations. 




Fig. 1 Geocomposite problem: slices through material distribution (left) and displacement field 
(right). 



Table 1 Condition number and number of iterations for different number of levels. 



num. of levels 


num. of subs. 


cond. num. 


num. of PCG its. 


2 


1024/1 


50 


46 


3 


1024/128/1 


79 


56 


4 


1024/128/16/1 


568 


131 



Table 2 Strong scaling for geocomposite problem using two, three, and four levels. 



number of processors 


64 128 256 512 


1024 


2 levels 






BDDC set-up time (s) 


61.0 37.7 25.7 23.2 


39.5 


PCG time (s) 


22.3 19.9 27.8 44.9 


97.5 


3 levels 






BDDC set-up time (s) 


49.5 29.0 18.4 12.6 


11.0 


PCG time (s) 


28.5 22.6 16.7 14.7 


13.2 


4 levels 






BDDC set-up time (s) 


49.4 28.6 17.8 12.3 


9.1 


PCG time (s) 


60.6 33.2 21.2 15.4 


11.8 



Our second example is a problem of Stokes flow in a 3D lid driven cavity. We 
use the set-up suggested in [9 |: zero velocity is prescribed on all faces of the [0, l] 3 
cube except the face for z = 1, where unit velocity vector u = [1/V3, y / 2/3,0] is 
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prescribed. We have used this test case also in the recent paper |5 1, but we have not 
presented parallel results there. 

The problem is uniformly discretized using hexahedral Taylor-Hood finite ele- 
ments. Computational mesh was divided into irregular partitions using the METIS 
graph partitioner (see Fig. [2] for an example). A plot of pressure inside the cavity 
and velocity vectors is given in Fig. |2](right). 




Fig. 2 Stokes flow in lid driven cavity: example of division into 64 subdomains (left), pressure 
contours and velocity vectors in the cut in direction of the prescribed velocity [l/\/3, ^2/3,0] 
(right). 



Table [3] summarizes a weak scaling test for this problem. The sequence of prob- 
lems ranging from 1.7 million unknowns to 25.4 million unknowns are distributed 
among processors such that the size of local problems is kept approximately con- 
stant around 50 thousand unknowns. We present results by the BDDC precondi- 
tioner using two and three levels combined with BiCGstab method. For compari- 
son, we also report results of the preconditioner based on Additive Schwarz Method 
(ASM) combined with BiCGstab and using MUMPS offered in the PETSc library 
version 3.1. We report numbers of iterations and required overall computational 
times. Where 'n/a' is present in the table, the simple ASM method was unable to 
provide the solution, which was often encountered for larger problems. The compu- 
tational times were obtained on Darwin supercomputer of the University of Cam- 
bridge, using Intel Xeon 5100 3.0 GHz processors with 2 GB of RAM per core. 

We can see in Tab. [3] that number of BiCGstab iterations remains almost con- 
stant for the two-level method, while mildly growing when using three levels, being 
again larger for the latter. The computational time slightly grows with problem size 
for BDDC, both using two and three levels, but this growth is rather acceptable. 
More importantly, we can also see that for this case the benefit of using an addi- 
tional level is slightly outweighed by the overhead of the additional iterations, and 
so the computational time is not improved by using three levels. Finally, the compar- 
ison with ASM results seem to confirm the computational efficiency of our solver 
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which is comparable with the state-of-the-art PETSc library, and the rather poor 
performance of ASM in comparison to BDDC in this setting. 



Table 3 Weak scaling for Stokes flow in the cavity: additive Schwarz method (ASM), and BDDC 
using two and three levels. 









ASM 




BDDC (2 levels) 


BDDC (3 levels) 




# elms. 


# dofs. 


# cores 


# its. 


time (s) 


# its. 


time (s) 


divisions # its. 


time (s) 


40 3 


1.7M 


32 


282 


533 


18 


122 


32/4/1 22 


126 


50 3 


3.2M 


64 


396 


805 


19 


132 


64/8/1 25 


205 


64 3 


6.7M 


128 


384 


536 


21 


186 


128/16/1 30 


194 


80 3 


13. 1M 


256 


n/a 


n/a 


21 


178 


256/32/1 36 


201 


100 3 


25. 4M 


512 


n/a 


n/a 


20 


205 


512/64/1 35 


211 



5 Conclusion 

We have presented a parallel open-source implementation of the multilevel BDDC 
method. The two-level algorithm has scalability issues related to the coarse problem 
solution, mainly in the part of iterations. It can be noted that for the tested cases, 
it has not been the size of the coarse problem, but rather its fragmentation among 
too many cores which causes these issues. From our experiments, it appears that 
the multilevel preconditioner tends to scale better in both parts - set-up and Krylov 
subspace iterations. While the better scalability is able to translate into much faster 
solution for some cases, the extra overhead can also just cancel out the savings for 
other cases. It is therefore important to choose appropriate number of levels for 
a particular problem. We expect that advantages of the multilevel approach would 
pronounce further for problems divided into many (tens of thousands) of subdo- 
mains. Such challenging problems will likely become common in near future and 
will provide valuable feedback for further research in this field. 
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